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Abstract 

Inspired by cognitive radio networks, we consider a setting where mul¬ 
tiple users share several channels modeled as a multi-user multi-armed 
bandit (MAB) problem. The characteristics of each channel are unknown 
and are different for each user. Each user can choose between the chan¬ 
nels, but her success depends on the particular channel chosen as well 
as on the selections of other users: if two users select the same channel 
their messages collide and none of them manages to send any data. Our 
setting is fully distributed, so there is no central control. As in many 
communication systems, the users cannot set up a direct communication 
protocol, so information exchange must be limited to a minimum. We 
develop an algorithm for learning a stable configuration for the multi-user 
MAB problem. We further offer both convergence guarantees and experi¬ 
ments inspired by real communication networks, including comparison to 
state-of-the-art algorithms. 


1 Introduction 

The inspiration for this paper comes from the world of distributed multi-user 
communication networks, such as cognitive radio networks. These networks 
consist of a set of communication channels with different characteristics, and 
independent users whose goal is to transmit over these channels as efficiently as 
possible. 

Modern networks, such as cognitive radio networks, must cope with several 
challenges. First and foremost, the networks’ distributed nature prohibits any 
form of central control. In addition, many users operate on an “ad hoc” basis, 
preventing them from forming inter-user communication. In fact, they probably 
do not even know how many users share their network. 

On top of these issues of multi-user coordination, the channel characteristics 
may be initially unknown, and differ between users. Thus, learning must be 
integrated into the solution. 
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1.1 Cognitive radio networks 

Cognitive Radio Networks (CRNs), introduced in [^, have attracted consider¬ 
able attention in recent years. The idea that lies at the heart of CRNs is that 
advanced sensing mechanisms and increased computation power may enable 
radio devices to dramatically improve their performance in terms of resource 
utilization, resilience and more. Networks of such users are usually dynamic 
and stochastic, giving rise to many interesting problems [^[^. We focus on de¬ 
veloping a sensing and transmission scheme that enables users to learn a stable, 
orthogonal configuration without communicating directly. 

1.2 Multi-armed bandits 

A well known framework for learning in CRNs is the classical Multi-Armed 
Bandit (MAB) model. MABs offer a simple, intuitive framework for learning 
the characteristics of a number of unknown options in an online manner, while 
balancing exploration and exploitation. A MAB problem consists of a single 
user repeatedly choosing between arms with different characteristics, that are 
initially unknown. After every round, the user acquires a reward that depends 
on the arm she chose. Her goal in most setups is to maximize the expected sum 
of rewards acquired over time. 

As suggested in [^, the channels of a CRN are naturally cast as the arms 
of a bandit, with different performance measures (bandwidth, ACK signals, bit 
rate) serving as the reward. 

Many papers propose solutions for the stochastic MAB problem (see, e.g., 
dll) and its adversarial version (see, e.g., H). but they all assume a single 
user is sampling the arms of the bandit. 

However, this assumption does not apply in multi-user networks. In the 
multi-user MAB model, users compete over the arms of the same bandit. As 
a result, they are bound to experience collisions (i.e., multiple users sampling 
the same arm), unless they employ some form of collision avoidance or coordi¬ 
nation mechanism. Collisions in communication networks result in performance 
degradation, corresponding to reward loss in the MAB model. In order to avoid 
reward loss, the presence of multiple users must be addressed. We survey several 
approaches to this issue in Section [L4l 

1.3 Extension of the CRN-MAB setting 

The novelty introduced in our paper lies in the combination of bandit learning, 
multiple users, different reward distributions for different users and no direct 
communication. The combination of these last two demands - different distri¬ 
butions and no direct communication, poses a real challenge. 

As explained in detail in Section [23| and in Section[2^ the only thing we can 
guarantee in terms of network behavior in this setup is stability. In a dynamic, 
distributed network, stability is of great value. Once a network has reached 
a stable configuration, users can focus on utilizing its resources, rather than 
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engaging in coordination or learning efforts; a stable network is more robust 
and efficient. 

Reaching stability is a nontrivial task, since users must learn their channel 
characteristics while coordinating their actions with the other users, based on 
very limited observations. 


1.4 Previous work 


We now present several approaches to the CRN-MAB problem, coming from 
different areas and disciplines. 

Our problem may be viewed as an assignment problem, i.e., maximum weight 
matching in a weighted bipartite graph. Users correspond to agents, channels 
to tasks, and rewards are simply the complementary of the costs of graph edges. 
Several papers have been published on the distributed assignment problem, but 
to the best of our knowledge none of them offers a solution for our problem. 
The well-known Hungarian method requires full knowledge of the graph 
(i.e., channel characteristics) and assumes the existence of central control. The 
Bertsekas auction algorithm frees us from the need for central control, at 
the cost of direct communication between nodes. The classical Gale-Shapley 
algorithm 11 solves the problem of finding a stable marriage configuration, but 


does not take the need to learn into account. Some papers have actually applied 
it to CRNs, but not in the learning context 12p3 


Another work on distributed 
stable marriage, that makes use of a variant of the Gale-Shapley algorithm, 
is [^. While it is quite foreign to our problem, the potential function defined 
in the paper is helpful in our analysis. Another noteworthy work in this context 
is 15 . The authors address the challenge of limiting communication between 
nodes to a minimum, and propose two communication models. Nevertheless, 
they allow more communication than we would like, and their formulation does 
not consider learning. Two additional results that deal with distributed stable 
marriage offer lower bounds and state that some form of information exchange 


is inevitable when solving such problems 16,17 


The papers closest to ours in spirit are those dealing with multi-user MABs. 
There has been work on the case of reward distributions that do not vary be¬ 
tween users, such as 18 and 19 . The latter introduces an algorithm that is 
able to cope with a variable number of users. Another paper, that addresses 


different reward distributions for different users, is 20 . Here, the authors em¬ 


ploy the Bertsekas auction algorithm. This approach enables users to reach a 
reward-maximizing solution, at the price of direct, frequent communication be¬ 
tween themselves. We further elaborate on the difference between our approach 
and the approach of in Section 

To this end, we would like to point out that communication between users is 
undesirable not only because of its price in terms of network resources and time. 
Once users depend on communication, they are more vulnerable to intentional 
attacks that may disrupt it, as well as noise bursts that are common in GRNs. 
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2 Model and formulation 

We now describe the model, the assumptions accompanying it and our goal. 

2.1 System and users 

We model a communication network with K channels, servicing N independent 
users. Our work is based on the assumption that K > N, which is reasonable 
since without it, implementing a time division based mechanism is necessary. 
Once such a mechanism is applied, the assumption that K > N is valid again. 
Time is slotted and users’ clocks are synchronized, also a mild assumption for 
modern communication systems. 

The communication network consists of K channels, where only one user 
can transmit over a certain channel during a single time slot. Each transmission 
yields a reward, which we assume to be stochastic. 

The users are a group of N independent, selfish agents. Their observations 
are local, consisting only of the history of their actions and rewards. In addition, 
they do not know the number of users they share a network with. There is no 
central control managing their use of the network, and they do not have direct 
communication with each other. 

A key characteristic of our model is that the expected reward a channel 
yields depends not only on the identity of the channel, but also on the identity 
of the user. Formally, the rewards of the channels are Bernoulli random variables 
with expected values {iJ,n,k}, where n € N} and k G K}. This 

property reflects the fact that in real-life users may experience location-based 
disturbances, manifested in different reward distributions for the same channel. 

We model the users’ sharing resources through the representation of the com¬ 
munication network by a single bandit. This means that two users attempting 
to access the same channel at the same time, will experience a collision. In our 
model, the result of a collision is complete loss of communication for that time 
slot for the colliding users, i.e., zero reward. A user n that accesses a channel 
k alone during a certain time slot will receive a reward drawn i.i.d. from a 
Bernoulli distribution with expected value iJ,n,k- Throughout the paper, we use 
the term configuration to refer to a mapping of users to channels. 

2.2 Limited coordination 

In an effort to keep our model faithful to real world CRNs, we limit the co¬ 
ordination between users to a minimum. Thus, users can only transmit in a 
channel of their choice, or sense the spectrum range and receive binary feedback 
regarding all channels {I,..., A"} at time t. A “0” represents no transmission 
in channel, while “I” stands for the opposite. 
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2.3 Reward maximizing solution 

We adopt a system-wide view for characterizing the optimal solution. The opti¬ 
mal configuration must be orthogonal (i.e., no more than one user per channel), 
in order to avoid collisions and the resulting reward loss. One common approach 
seeks to maximize the sum of rewards over all users, over time. The assignment 
of users to channels is chosen accordingly: 

N 

R max ^ ^ /^n,'7r(n) 5 
ttGC ' ^ 

n —1 

where C is the set of all possible permutations of subsets of size N chosen without 
replacement from the set {1,..., i^}. 

However, reaching such a solution requires frequent information exchange. 
Assume channel k is optimal for two different users m and n, but ^rn,k > 
^^n,k- To maximize the system-wide reward, user n must step down and choose 
a different channel. The lack of central control requires explicit information 
exchange regarding the values of ^rn,k and /in.fe, for m and n to decide which of 
them should step down. Since the reward estimates are updated as time goes 
by, such preferences must be communicated repeatedly. 

Due to limited information exchange, a reward-maximizing solution cannot 
be guaranteed in our setup. We therefore focus on convergence to a stable, 
orthogonal configuration. 

2.4 Stable marriage solution 

Our goal is to develop policies that will lead users to a stable configuration. We 
employ the notion of stable marriage to formally define stability: 

Definition 1. A Stable Marriage Configuration (SMC) is an assignment of 
users to channels such that no two users would be willing to swap channels, had 
they known the true values of the expected rewards. Formally, for a pair of users 
n,m: 


51 = {lXn,a„ < S-n,a,„) uscv n would like to swap 

5 2 — S: Pm,a„) uscv m is willing like to swap, 

where Om and a„ are the users’ current actions. In an SMC, 

S'! A S '2 = 0 Vn, TO. 


2.5 Goal 

Given a system with K channels and N users, allowing only limited commu¬ 
nication as described in Section 2.2 our goal is to reach a configuration that 
is orthogonal: no two users use the same channel, and an SMC, according to 
Definition [T] 
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Figure 2: Selection of initiator 


3 Coordination protocol 

Our coordination protocol balances the limitations of SectionjZ^with the users’ 
need for information exchange by introducing a signalling mechanism between 
pairs of users. At predefined time slots, a user wishing to occupy a channel may 
transmit in that channel to express her wish. In order to ensure that this signal 
is received by the user currently occupying the channel, we employ a frame- 
based protocol. We assume users can transmit and sense at the same time, a 
reasonable requirement in modern communication systems. 

The following explanation is best understood by observing Figure Our 
protocol divides time into super frames of length Tsf = 2 -|- 2 {K — 1). Each 
super frame begins with a pair of time slots. Si and S 2 , during which a single 
signalling user, the initiator, is coordinated for the entire super frame. The 
procedure is described in Algorithm and in Figure Next come K — 1 mini¬ 
frames of two time slots each, denoted by S 3 and 54 . Each of these mini-frames 
corresponds to one channel on the initiator’s list of preferred channels. Thus, a 
single super frame enables one user to go over her entire preference list and signal 
other users, suggesting they swap channels with her, as explained in Figure 
The time slots marked 5*4 allow users not participating in the coordinating 
process during a certain mini-frame to sample their current channel and proceed 
with the learning-while-transmitting process. Thus, all but two users (initiator 
and responder) gather a sample during each mini-frame, resulting in at least 
K — 2 samples for each of the users, except for the initiator, over each super 
frame. 

While this may seem like much coordination, the protocol is very simple 


6 


































Initiator 

1 



1 desired channel 1 



Proceed to next i 
best channel j 




Sense desired 
channel j 


Rejected \ Accepted \ 


Swap 


Responder 


Sense own 
channel 


Approached 
by initiator 

I Agree to swap? 


Transmit in 
own channel 


Transmit & learn , 


Figure 3: Initiator-responder dynamics 


to implement, and is indeed lightweight when compared to other protocols, as 
further explained in Section and in Section 


4 The CSM-MAB algorithm 

We now turn to a full description of our algorithm, the Coordinated Stable 
Marriage Multi-Armed Bandit (CSM-MAB) algorithm. We propose a user- 
level algorithm for a fully distributed system, whose goal is described in Section 
|2.5| When all users in the network apply CSM-MAB, the assignment of users 
to channels is guaranteed to be orthogonal, and converges to an SMC. 

Our algorithm begins with a start up phase, during which users transmit and 
sense to detect collisions, in order to reach an initial orthogonal configuration 
(line 1). This phase follows the lines of the CFL algorithm introduced in [^, and 
converges quickly. Once an initial orthogonal configuration has been reached, 
users start executing the CSM-MAB algorithm, described in Figure]^ 

At the beginning of each super frame, users execute the rank.channels 
procedure to individually create a list of channels they prefer over their current 
action (line 4). Channels are assigned values according to their UCB indices, 
calculated using the well known formula from [^: 


(^) — f^n,k 4” A / , (1) 

y 

where jln,k is the empirical mean of the reward acquired by user n on channel k 
up till time t and Sn,k is the number of times she sampled arm k up till time t. 

Next, the users coordinate an initiator according to the scheme in Figure 
Every user who would like to improve upon her current channel presents herself 
as the initiator with a probability of e = (lines 5-11). An agreed initiator 
for the SF emerges if and only exactly one user raises her flag (the value of e 
is chosen in order to maximize the probability of this occurring). Once a single 
initiator is agreed upon, all users take note of her current channel, based on 
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1 : a„ (0) ^ apply _CrL(ii:) 

2: for all frames t do 

3: if mod (fjTsp) == 1 then {Beginning of SF} 

4: list -ir- rank_channels(a„ (f — 1), /t„, s„) 

5: if list 7 ^ 0 then {User seeks to change channel} 

6 : flagn rand(Bernoulli, e) 

7: if {flagn == 1) A {flagt == 0 Vi 7 ^ n) then 

8 : initiator = n {User n is initiator for this SF} 

9: pref = 1 {Initialize swapping preference to 1} 

10: end if 

11: end if 

12: else 

13: if {initiator == n) A {pref > 0) then {n is the initiator, list not ex¬ 

hausted yet} 

14: responses propose_swap(?ist (pre/)) 

15: if response == 1 then {Responder agreed or channel is available} 

16: a {t) ^ swap(a„ {t ), list {pref)) 

17: pref ^ 0 

18: else 

19: pref ^ pref -I- 1 {Move to next best channel} 

20: end if 

21: end if 

22: end if 

23: r„ {t) e-execute_action(a„ (<)) 

24: update_stats(r„ {t) , jln,an(t), Sn,an(t)) 

25: end for 

note: jj,n,k is the empirical mean of the reward for user n on arm k; Sn,k is the 
number of times she has sampled it. 


Figure 4: The CSM-MAB algorithm 





their sensing. They will need this knowledge to decide whether to accept her 
swapping suggestion. 

The initiator proceeds to signal other users, based on her ranking of channels 
(lines 13-21). Signalling is implemented in propose.swap by transmitting in 
the initiator’s channel of interest. Each responder (i.e., signalled user) checks 
whether swapping channels with the initiator will improve her situation, based 
on her own ranking. Once a responder agrees, a swap takes place. No more 
signalling attempts are made till the end of the SF, and users simply continue 
sampling their chosen channels. If the responder refuses, the initiator will ap¬ 
proach the next-best channel on her list. She will continue the process until 
she (a) finds a partner that agrees to swap; or (b) exhausts her list of potential 
swaps. This part of the algorithm is depicted in Figure 


5 Analysis 


We will now show that the CSM-MAB meets the goals defined in Section 2.5 
Our main theoretical result is stated in Theorem [T] 


Theorem 1. Consider a system with K channels and N users, with channel 
rewards characterized by the matrix fi. Applying CSM-MAB (Algorithm^ by 
all users will result in convergence to an orthogonal SMC: For all 6 > 0 there 
exists T (S) such that for all time slots t > T, the probability of the system’s 
being in an SMC is at least 1 — i5. 

The proof of Theorem [^consists of two aspects: orthogonality and stability. 
The first part is easy to verify. 


Proposition 1. The actions of users applying CSM-MAB are orthogonal (i.e., 
there is at most one user sampling each channel) for all t > t^ with probability 
of at least 1 — Jq • 


Proof. Based on Theorem 1 of 21 , the initial configuration reached after run¬ 


ning the CFL algorithm is orthogonal with probability 1. The authors provide 
an upper bound on the distribution of stopping times, r: 


P [r > fc] = ae 


where a and 7 are some positive constants. The expected stopping time is 
therefore upper bounded by Thus, setting to = the probability 

of not having reached an orthogonal configuration by time to is at most 60 = 

e ^ . Once the system reaches an orthogonal configuration, a user does 

not switch to an occupied channel without having coordinated the switch, as 
defined in Algorithm]^ □ 


5.1 Stability and potential 

Showing that our system converges to a stable solution is more involved. We 
begin by defining a potential function for the problem. For any user n G 
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{1,..., A^}, the potential at time t is defined as follows: 


K 

4^n (^) ~ ^ ^ H {/^n,k ^ 1)} 5 (^) 

fc=l 

where a„ (t — 1) is the action taken by user n in the previous time step. In 
words, the potential is the number of channels user n would prefer over her 
current choice, had she known their true reward distributions. The system-wide 
potential is the sum of potentials over all users: 

N 

(3) 

n—1 

An illustration of the potential appears in Tables 1 and 2. 


Table 1: Table of users’ channel rankings (first row represents best channel, last 
row represents worst). Cells highlighted in yellow and underline represent user’s 
current choice. 



Ui 

U2 

Us 

1 

1 

2 

4 

2 

2 

1 

1 

3 

4 

3 

2 

4 

3 

4 

3 


Table 2: User potentials corresponding to the configuration in Table [l] 



<(>2 

<('3 

3 

1 

0 


In terms of potential, a configuration is an SMC if no two users can swap 
channels and decrease their potential by doing so. We note that a stable con¬ 
figuration does not necessarily correspond to zero system-wide potential, since 
not all users might be able to achieve zero potential simultaneously, depending 
on network parameters. Also, a system may have several stable configurations, 
each characterized by a different potential. Nevertheless, observing a system’s 
potential does provide an indication regarding stability: once a system reaches 
a stable configuration, its potential will no longer change. 

We prove convergence to an SMC by using the potential function, considering 
three aspects: 

1. The maximal potential of a system with K channels and N users is finite 
and equal to N {K — 1). 

2. The potential d> (t) is monotonously non-increasing with high probability. 
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3. Until an SMC is reached, changes in potential are bound to happen within 
finite time. 

We formalize and prove these statements in the sequel. 

Since users’ decisions are guided by UCB indices, while stability is examined 
with respect to true reward distributions, users do not always update their choice 
of channels in a way that matches the ground truth. Thus, the system potential 
may occasionally increase, due to users’ exploration or inaccurate statistics. 
In our proof we show that despite this, users ultimately converge to a stable 
configuration. 


5.2 Proof of Theorem [T] 

We begin by ensuring the monotonicity of the potential. 

Lemma 1. For all times t for which t > Int, if a change in potential 
occurs, it is a decrease, with probability of at least 1 — 2 t~^. 

Amin is a distribution dependent constant. In the appendix we derive an 
upper bound on the minimal time for which the condition above holds: 


M - 1 - J{M -if -AM 

imin < --, (4) 

where M = This bound will enable us to use tmin in the proof. 

Next, we introduce a lemma that concerns the ability of a single user to 
reach the position of the initiator. 

Lemma 2. If (/>„ (t) > 0 for some user n, then her probability of becoming the 
next initiator is at least e (I — e)^ ^. 

Using Lemmawe show another result: 

Lemma 3. If the system is not in an SMC at some time t, then a change in 
the potential will occur within no more than t' (Ji) time slots with probability of 
at least 1 — 6i. 


The exact dependency of t' on appears in the appendix, as do the proofs 
of all lemmas. 

The probability of the system’s reaching an SMC within r = t'N (K — I) 
time slots after time tmin is at least 


I^SMC 


A 


(1 - 5i) (I - 


N{K-1) 


We model the convergence to an SMC using a Markov chain. Let St denote the 
state of the system at time t: 


St = 


if in SMC, 
else. 
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The following holds for the chain’s transition probability: 


P[5*+. = 1|^*„,„=0]>Psmc, 


and also 

P [St = 0| = 0] < (1 - PsMc)l- " -I, 

VT > tmin + T. 

Defining 5 = (1 — Psmc)'- ^ ^ completes the proof, and inverting yields 

In (5 

““^^ln(l-PsMc)- 

Our next result quantifies the time devoted to signalling. 

Proposition 2. In every super-frame {K — 1) {N — 2) learning samples are 
gathered by all users combined. During this period AK signalling and sensing 
actions are performed by all users combined, so the signalling to learning ratio 
is 


(K-l) (iV-2)- 

Clearly, the effort the users put into coordination is most effective when 
the number of users is close to the number of channels. This is a result of the 
frames’ length being dictated by the number of channels rather than the number 
of users, in order for the user-level algorithm to be independent of the number 
of users. 


6 Experiments 

To demonstrate the merits of our algorithm, we implement a simulation of a 
distributed multi-user communication network. The users in our network are 
synchronized, and time is slotted. 

In this network, users cannot communicate with each other directly. How¬ 
ever, they can sense the entire frequency range (i.e., listen to all channels). They 
may also transmit over a channel of their choice, updating this choice each time 
slot. 

A user n transmitting over a channel k receives a binary reward, drawn i.i.d. 
from a Bernoulli distribution with parameter fj,n,k- This can be viewed as a 
form of the classic binary symmetric channel. As far as the different values of 
the reward parameters go, we ran experiments in two different modes: 

1. random: the fXn.k’s are drawn uniformly and independently from the in¬ 
terval [0,1]. 
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Figure 5: Changes in users’ choice of channels, single realization 


2. real-world: users are divided into clusters, and each cluster has a preferred 
group of channels. This represents a scenario in which users sharing a 
cluster are geographically close, and experience an interference in part 
of the frequency range. In real-world wireless communication systems, 
an agent that does not belong to the network but is transmitting in its 
vicinity will often cause a similar phenomenon. 

We present results obtained in an experiment with K = 12 channels and 
IV = 10 users. The users are divided into two clusters. Users 1-5 belong to 
one cluster, and experience an interference in the frequency range of channels 
7-12. Users 6-10, on the other hand, experience similar performance over the 
entire frequency range. Experiments last T = 120000 time slots, and results are 
averaged over 50 repetitions. 

We begin by examining the cumulative number of policy changes per user 
over time, plotted in Figureand in Figure]^ Since our goal is stability, we 
would like the number of policy changes to be small, and indeed the rate of 
changes decreases significantly over time. Another observation, demonstrated 
by the two figures, is that different users have different patterns, depending on 
the realization but more importantly on the difficulty of their problem: users 
that have small differences between channels will need more samples in order to 
tell them apart, and will therefore experience more policy changes. 

Our next result examines the convergence to different SMCs over several 
repetitions of one setup. In this case, the set of SMCs consists of 305 config¬ 
urations. Naturally, the size of this set depends on the number of users, 
the number of channels, iV, and also on the specific realization of the fj.n,k’s. 
Figure shows that the periods of time users spend in unstable configurations 
decrease as the experiment advances, and users move between different SMCs, 
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Figure 6: Changes in users’ choice of channels, empirical average 


depending on the realization. 

To complement our proof, we provide a visualization of the system potential 
over time, averaged over several repetitions, in FigureAs shown in the proof, 
the potential decays on average. The shaded area around the plot represents the 
variance over iterations, which also decays over time. As explained in Section 
|5.1[ the potential does not necessarily decay to zero, but rather to a constant 
value that represents the potential of the SMC. 

Our last result examines the reward acquired by users employing the CSM- 
MAB algorithm. While our theoretical guarantees focus on stability, the al¬ 
gorithm incorporates reward maximization implicitly by using UCB indices to 
rank channels. However, as explained in Section [2.3[ reaching a reward-optimal 
configuration cannot be guaranteed with the limited form of communication 
we allow. In Figure we compare the cumulative system-wide reward of two 
algorithms: our CSM-MAB and the dUCB4 algorithm, introduced in 20 . As 


explained in Section [1.4[ dUCB4 incorporates an auction algorithm in order to 
achieve an orthogonal reward maximizing configuration. 

The price of reward maximization is, clearly, communication, which our 
scheme attempts to bring to a minimum. In order to implement the auction 
algorithm required by dUCB4, users must have distinct id’s and knowledge of 
the number of users. This rather technical requirement hinders the ability of 
the algorithm to deal with a variable number of users. Our algorithm naturally 
extends to a scenario in which users arrive and leave at random times, that is 
quite likely in the context of CRNs. In addition, auction algorithms inherently 
rely on the good will of users, and are therefore more vulnerable to malicious 
agents (e.g., agents that report false high bids for attractive channels). 

The results in Figure demonstrate the tradeoff between communication 
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Figure 7: Convergence to SMC for different realizations: horizontal axis shows 
time, vertical axis shows numbering of realizations. White pixels represent 
unstable configurations, other colors correspond to different SMCs. As time 
goes by, longer stretches of time are spent in SMCs. 




Figure 8: Decay of system potential over time, averaged over 50 repetitions. 
The shaded area represents variance. 
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Figure 9: Cumulative system-wide reward over time, for different algorithms 


and reward maximization: the time dUCB4 invests in auctioning is quite domi¬ 
nant. The two variants of the algorithm differ in the accuracy of the auctioning 
algorithm. The “dUCB4” variant (dotted red) uses 32 bits to encode variables, 
while the “dUCB4Long” variant (dashed magenta) uses 64 bits. Because of 
auctioning, it takes the algorithm a long time to turn its focus to reward maxi¬ 
mization. In the high-accuracy case, the users exhaust all their time auctioning. 
In the low-accuracy case, they only begin acquiring rewards towards the end of 
the experiment. In real-world networks, with constantly changing conditions, 
such a long start-up phase is difficult to overlook. For the sake of example, 
let us examine an average 802.lln WLAN network, with a nominal frame size 
of 2000 bits and typical bit rate of 25 megabits per second. The 4 • 10® time 
slots it takes dUCB4 to start acquiring rewards are translated into a period of 
25 10 °°° “ 32sec. This start-up phase doubles to over one minute when 64 
bit accuracy is used for the auction algorithm. Of course, lighter schemes than 
the 802.11 can be used, but these numbers clearly demonstrate the potentially 
crippling overhead brought on by communication. 

We note that when N is strictly less than K, our algorithm often reaches the 
reward optimal configuration, or a configuration very similar in reward values. 
Therefore, the variance of the cumulative reward is very small. Our intuitive ex¬ 
planation is that when N < K users have a certain degree of freedom, increasing 
their chances of landing in the optimal configuration. 

Despite reaching a configuration that is very close to optimal in the presented 
simulations, our algorithm acquires reward at a slower rate than dUCB4, due 
to the constant ratio of coordination and exploitation. Decreasing the amount 
of time devoted to coordination may considerably increase the reward, at the 
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cost of impairing the algorithm’s ability to handle a variable number of users. 
We plan to address this issue in detail in the future. 


7 Discussion 


We present an extension of the multi-user MAB problem, for the case of differ¬ 
ent reward distributions between the users, with limited information exchange. 
Using a specialized signalling method, our algorithm enables multiple users to 
learn network characteristics and converge to an orthogonal configuration that 
is also a stable marriage. We provide a theoretical analysis of our algorithm’s 
performance, based on the notion of system potential. Finally, we present the 
results of an experimental setup and examine different aspects of our approach’s 
performance, including a comparison to the dUCB4 algorithm of 20 . As ex¬ 


plained in Section|^in further detail, the main difference between the algorithms 
is the way they strike a balance between minimizing communication and maxi¬ 
mizing the reward. We argue that our algorithm is better suited for real world 
problems. 

In the future we intend to extend our work to a dynamic scenario, both 
in terms of channel characteristics and number of users. The latter should be 
straightforward due to the minimal inter-dependency of users, while the former 
will require some adjustment of the learning algorithm. Another interesting 
variant, applicable to networks with a fixed number of users, alters the ratio 
between coordination and exploitation as time goes by, to enable better use of 
network resources. 
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Appendices 

A Proof of Lemma [l] 

We would like to show that for all values of t for which t > a\nt, the probability 
that the potential decreases every time it changes is at least 1 — 4t“^, where 


Given that a change in potential occurs at time t, it is guaranteed to result 
in a potential decrease if it benefits both users. This will happen if both users’ 
indices, that guide their decisions, are accurate w.r.t the true distribution. 
Since we condition on a change in potential, 

P [$Dec] = 1 - P [$Inc] 

Let us upper bound P [$i„c] • For a user n switching from arm j to arm i at 
time t, < pin.j, 


P [4>Inc] = P [In,i (t) > In,j (t) C fin,i < Mnj] , 

where In,i (t) is user n’s UCB index of arm i at time t, defined in Q. Following 
the proof of Theorem 1 of [^ , 

P [d^Inc] — P [Ariji (f) 4“ — Tn,j (t) 4“ Ct,Sn,j C tin,i ^ 


provided that 


8 Inf 



(5) 


where Sn,i is the number of times user n sampled arm i up till time t and 
(^) — Tn,i — IJ-nj- If @ does not hold, then the UCB index “misleads” 
user n, causing her to mistakenly favor arm i, despite its lower expected reward. 
Switching from arm j to arm i will result in an increase in potential. However, 
once she acquires another sample of arm i, its index will decrease. In the 
meantime, the index of arm j will increase due to the passing time, and the 
indices will ultimately reflect the correct preference, resulting in a potential 
decrease. 
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The extreme value for ([^, i.e., the largest number of required samples, 
corresponds to the minimal value of (n). Let us define: 


A„ = min ^ j ] 

Aniin — min A^ 


Thus, when all arms have been sampled at least 

A 8Inf 
Smin — « 2 

min 

times, the probability of an increase in potential is very small. 

In order to allow for the coordination protocol, users do not gather informa¬ 
tive samples in every time slot. Instead, they gather at least K — 2 samples in 
each super frame, whose length is Tsf = 2 + 2 {K — 1) = 2K. 

Therefore, taking into account the fact that the sampling condition in § 
must apply for all arms, the condition on t is 


t > K 


Tsf 
K -2' 


16K^ 


. ^min — 


{K-2)AT 


, 16K , 

Inf > —75— Inf. 

A^ . 


(7) 


For all times f for which 0 holds, if a change in potential occurs, it is a decrease, 
with probability of at least 1 — 2f“'*. 

When we apply this lemma we will use a quantity fmin, an upper bound on 
the minimal f for which Q holds. Introducing a well-known lower bound on 
the logarithmic function: 


, a: — 1 

Ina; > - Vx > 1. 

a; -I- 1 


We use this lower bound together with Q: 


_ 16K 16K - 1 

t-min — . n illbmin 




A^ f ■ -I-1' 
^min ''mm ^ r 


Denoting M A continue: 


1 

^min T 1 


^min + (1 ~ Af) fmin + M >Q. 


Our conclusion is that fmin A ^ ^ ■ Since this expression is finite, 

we may now use it in our proof. 
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B Proof of Lemma [2] 


The probability of a specific user becoming the initiator when there are i inter¬ 
ested users is 

Ps (e, £) = P [specihc initiator! i interested] 

= V£e 

The probability is minimized when all TV users would like to become the initiator, 
yielding the bound e (1 — e)^ ^. 

C Proof of Lemma [3] 

If the system has not reached an SMC, then according to Definition [T] the 
conditions Si, S 2 hold for at least one pair of users n,m. 

According to the definition of the CSM-MAB algorithm, if 5'i holds, then 
user n will add the channel user m is sampling to her list of preferred chan¬ 
nels with a probability of at least 1 — Following arguments similar to those 
presented in the proof of Lemma [ij 5 < If S 2 holds, user m will accept 

user n’s swap proposal, assuming her statistics are accurate. This, once again, 
happens with a probability of at least 1 — <5. Once users n and m swap channels, 
the potential will change. 

In the worst case (i.e., largest t'), user m’s channel will be the last channel on 
user n’s list, and all users higher on the list will decline user n’s swap proposals. 
If user n approaches a different user (whose channel is ranked higher than rn’s), 
and that user agrees to swap, the potential will also change. 

What is left to prove is that the time it shall take user n to receive the 
privilege of being initiator is hnite. Once n is appointed the initiator, it will take 
no more than K —1 mini-frames, i.e., 2 {K — 1) time slots, until she approaches 
user m and a swap takes place. 

There are two different cases - if n, m are the the only unstable pair, then 
they will be the only ones interested in becoming the initiators. Furthermore, 
if only one of them is dissatisfied, then there will only be one user interested 
in initiating. In the notation of Lemma this corresponds to £ = 2 or f = 1, 
respectively. The probability of exactly one of them becoming the initiator is 
Pi _2 = min {e, 2 e (1 - e)}. 

If there are additional unstable pairs, there will be more nominees for initi¬ 
ating. However, not all super frames necessarily result in a decrease in potential 
- if the initiator only targets channels occupied by “satisfied” users, all her at¬ 
tempts will be rejected. Therefore, we need to address the worst case scenario, 
in which all N users attempt to initiate, but only one of them is in a position 
that will actually result in a swap. Based on Lemmathe probability of that 
user emerging as the single initiator is at least e(l — e)'^ for a single super 
frame. This probability is smaller than Pi 2 for all e,N, and is therefore the 
lower bound for the probability of a single initiator with actual capacity for a 
decrease in potential. 
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Tsf 


3F J ' 


The 


The number of SFs in a time interval of length t' is C = 
probability that a single initiator with actual capacity for a decrease in po¬ 
tential does not emerge in a certain SF is less than 1 — e(l — e)^ and the 
probability that a single initiator does not emerge in the interval is less than 

Pc — — e (1 — ■ As t' —)■ oo, so does C, and Pc decays to zero. 

Binding the two aspects of this lemma together, we have that the probability 
of a single initiator with actual capacity for coordinating a switch emerging in an 
interval of length t' is at least 1 — Pc- The probability of a swap between users 
whose actions do not correspond to a stable configuration is at least (l — . 

The combined result: if the system is not in an SMC at time t, then a change 
in the potential will occur within no more than t' time slots with probability of 

at least (1 — Pc) (l — where Pc — — e(l — ^ 

Let us re-write the result for the sake of clarity: if the system is not in an 
SMC at time t, then a change in the potential will occur within no more than 
t' ((5i) time slots with probability of at least 1 — (5i. Developing the previous 
expression for the probability of a change in potential: 


(1 - Pc) (1 - = (1 - Pc) (1 - 4t-4 + 4t-®) 

> (1 - Pc) (1 - 4t-4) 


= 1-Pc-At 
>1-Pc-At 


-4 


4Fct 


-4 


-4 


From now on, we denote Using this, we can derive an expression 

for t' (Si): 


Pc = <5i-4t- 


-4 

min 




- (5i - At^^^ 


Tsf 


■ In (l - e (1 - e)^ = In {6i - AtJ^) 


t' = Tsf - 


In {5i - AtJ^) 
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